(N 

o 



a^ 






X 



Linear MMSE-Optimal Turbo Equalization 

Using Context Trees 

Nargiz Kalantarova, Kyeongyeon Kim, Suleyman S. Kozat, Senior Member, IEEE, and 

Andrew C. Singer, Fellow, IEEE 



Abstract 

Formulations of the turbo equalization approach to iterative equalization and decoding vary greatly 
when channel knowledge is either partially or completely unknown. Maximum aposteriori probabiUty 
(MAP) and minimum mean square error (MMSE) approaches leverage channel knowledge to make 
explicit use of soft information (priors over the transmitted data bits) in a manner that is distinctly 
nonlinear, appearing either in a trellis formulation (MAP) or inside an inverted matrix (MMSE). To date, 
nearly all adaptive turbo equalization methods either estimate the channel or use a direct adaptation 
equalizer in which estimates of the transmitted data are formed from an expressly linear function of the 
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^O I received data and soft information, with this latter formulation being most common. We study a class of 



direct adaptation turbo equaUzers that are both adaptive and nonlinear functions of the soft information 

f^ ■ from the decoder We introduce piecewise linear models based on context trees that can adaptively 

O ■ 

04 ' approximate the nonlinear dependence of the equalizer on the soft information such that it can choose 



both the partition regions as well as the locally linear equalizer coefficients in each region independently, 
with computational complexity that remains of the order of a traditional direct adaptive linear equalizer. 



5-H ■ This approach is guaranteed to asymptotically achieve the performance of the best piecewise linear 

equalizer and we quantify the MSB performance of the resulting algorithm and the convergence of its 
MSB to that of the linear minimum MSB estimator as the depth of the context tree and the data length 
increase. 

Index Terms 

Andrew C. Singer (acsinger@illinois.edu) is with the Electrical and Computer Engineering Department at University 
of Illinois at Urbana-Champaign. Suleyman S. Kozat and Nargiz Kalantarova ({skozat,nkalantarova}@k;u.edu.tr) are with 
the Electrical Engineering Department at Koc University, Istanbul, tel: +902123381684, +902123381490. Kim Kyeongyeon 
(adrianakky@gmail.com) is with Samsung Electronics, Gyeonggi-do, Republic of Korea. 



February 27, 2013 DRAFT 



Turbo equalization, piecewise linear, nonlinear equalization, context tree, decision feedback. 
EDICS Category: MLR-APPL, MLR-SLER, ASP-APPL. 

I. Introduction 

Iterative equalization and decoding methods, or so-called turbo equalization [l]-[3], have become 
increasingly popular methods for leveraging the power of forward error correction to enhance the perfor- 
mance of digital communication systems in which intersymbol interference or multiple access interference 
are present. Given full channel knowledge, maximum aposteriori probability (MAP) equalization and 
decoding give rise to an elegant manner in which the equlization and decoding problems can be (approx- 
imately) jointly resolved [1]. For large signal constellations or when the channel has a large delay spread 
resulting in substantial intersymbol interference, this approach becomes computationally prohibitive and 
lower complexity linear equalization strategies are often employed [4]-[6]. Computational complexity 
issues are also exacerbated by the use of multi-input/multi-output (MIMO) transmission strategies. It is 
important to note that MAP and MMSE formulations of the equalization component in such iterative 
receivers make explicit use of soft information from the decoder that is a nonlinear function of both the 
channel response and the soft information [5]. In a MAP receiver, soft information is used to weight 
branch metrics in the receiver trellis [7]. In an MMSE receiver, this soft information is used in the 
(recursive) computation of the filter coefficients and appears inside of a matrix that is inverted [5]. 

In practice, most communication systems lack precise channel knowledge and must make use of pilots 
or other means to estimate and track the channel if the MAP or MMSE formulations of turbo equalization 
are to be used [5], [7]. Increasingly, however, receivers based on direct adaptation methods are used for 
the equalization component, due to their attractive computational complexity [4], [8], [9]. Specifically, the 
channel response is neither needed nor estimated for direct adaptation equalizers, since the transmitted data 
symbols are directly estimated based on the signals received. This is often accomplished with a linear or 
decision feedback structure that has linear complexity in the channel memory, as opposed to the quadratic 
complexity of the MMSE formulation, and is invariant to the constellation size [8]. A MAP receiver not 
only needs a channel estimate, but also has complexity that is exponential in the channel memory, 
where the base of the exponent is the transmit constellation size [7]. For example, underwater acoustic 
communications links often have a delay spread in excess of several tens to hundreds of symbol periods, 
make use of 4 or 16 QAM signal constellations, and have multiple transmitters and receive hydrophones 
[9], [10]. In our experience, for such underwater acoustic channels, MAP-based turbo equalization is 
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infeasible and MMSE-based methods are impractical for all but the most benign channel conditions [9]. 
As such, direct-adaptation receivers that form an estimate of the transmitted symbols as a linear function 
of the received data, past decided symbols, and soft information from the decoder have emerged as the 
most pragmatic solution. Least-mean square (LMS)-based receivers are used in practice to estimate and 
track the filter coefficients in these soft-input/soft-output decision feedback equalizer structures, which 
are often multi-channel receivers for both SIMO and MIMO transmissions [4], [6], [8]. 

While such linear complexity receivers have addressed the computational complexity issues that make 
MAP and MMSE formulations unattractive or infeasible, they have also unduly restricted the potential 
benefit of incorporating soft information into the equalizer. Although such adaptive linear methods may 
converge to their "optimal", i.e., Wiener solution, they usually deliver inferior performance compared to 
a linear MMSE turbo receiver [11], since the Wiener solution for this stationarized problem, replaces the 
time-varying soft information by its time average [5], [12]. It is inherent in the structure of such adaptive 
approaches that an implicit assumption is made that the random process governing the received data and 
that of the soft-information sequence are both mean ergodic so that ensemble averages associated with 
the signal and soft information can be estimated with time averages. The primary source of performance 
loss of these adaptive algorithms is due to their implicit use of the log likelihood ratio (LLR) information 
from the decoder as stationary soft decision sequence [11], whereas a linear MMSE turbo equalizer 
considers this LLR information as nonstationary a priori statistics over the transmitted symbols [5]. 

Indeed, one of the strengths of the linear MMSE turbo equalizer lies in its ability to employ a distinctly 
different linear equalizer for each transmitted symbol [5], [6]. This arises from the time-varying nature 
of the local soft information available to the receiver from the decoder. Hence, even if the channel 
response were known and fixed (i.e., time-invariant), the MMSE-optimal linear turbo equlizer correponds 
to a set of linear filter coefficients that are different for each and every transmitted symbol [13], [5]. 
This is due to the presence of the soft information inside a inverted matrix that is used to construct 
the MMSE-optimal equalizer coefficients. As a result, a time-invariant channel will still give rise to a 
recursive formulation of the equalizer coefficients that require quadratic complexity per output symbol. 
As an example in Fig. 1, we plot for a time invariant channel the time varying filter coefficients of the 
MMSE linear turbo equalizer, along with the filter coefficients of an LMS-based, direct adaptation turbo 
equalizer that has converged to its time invariant solution. This behavior is actually manifested due to 
the nonlinear relationship between the soft information and the MMSE filter coefficients. 

In this paper, we explore a class of equalizers that maintain the linear complexity adaptation of linear, 
direct adaptation equalizers [8], but attempt to circumvent the loss of this nonlinear dependence of the 

February 27, 2013 DRAFT 



- 6th tap of MMSE TEQ 

- 7ttl tap of MMSE TEQ 
6th tap of LMS TEQ 
7th tap of LMS TEQ 




20 25 30 

time [symbols] 



Fig. 1. An example of time vaiying filter coefficients of an MMSE turbo equalizer (TEQ) and steady state filter coefficients 
of an LMS turbo equalizer (TEQ) in a time invariant ISI channel [0.227, 0.46, 0.688, 0.46, 0.227] at the second turbo iteration. 
(SNR — lOdB, feedforward filter length =15, step size = 0.001, BPSK, random interleaver and | rate convolutional code with 
constraint length of 3 are used) 



MMSE optimal equalizer on the soft information from the decoder [5]. Specifically, we investigate an 
adaptive, piecewise linear model based on context trees [14] that partition the space of soft information 
from the decoder, such that locally linear (in soft information space) models may be used. However 
instead of using a fixed piecewise linear equalizer, the nonlinear algorithm we introduce can adaptively 
choose the partitioning of the space of soft information as well as the locally linear equalizer coefficients 
in each region with computational complexity that remains on the order of a traditional adaptive linear 
equaUzer [7]. The resulting algorithm can therefore successfully navigate the short-data record regime, 
by placing more emphasis on lower-order models, while achieving the ultimate precision of higher order 
models as the data record grows to accommodate them. The introduced equalizer can be shown to 
asymptotically (and uniformly) achieve the performance of the best piecewise linear equalizer that could 
have been constructed, given full knowledge of the channel and the received data sequence in advance. 
Furthermore, the mean square error (MSE) of this equalizer is shown to convergence to that of the 
minimum MSE (MMSE) estimator (which is a nonlinear function of the soft information) as the depth 
of the context tree and data length increase. 

Context trees and context tree weighting are extensively used in data compression [14], coding and data 
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prediction [15]-[17]. In the context of source coding and universal probability assignment, the context 
tree weighting method is mainly used to calculate a weighted mixture of probabilities generated by the 
piecewise Markov models represented on the tree [14]. In nonlinear prediction, context trees are used 
to represent piecewise linear models by partitioning the space of past regressors [15], [17], specifically 
for labeling the past observations. Note that although we use the notion of context trees for nonlinear 
modeling as in [14], [16]-[19], our results and usage of context trees differ from [14], [16]-[18] in a 
number of important ways. The "context" used in our context trees correspond to a spatial parsing of the 
soft information space, rather than the temporal parsing as studied in [14], [16], [17]. In addition, the 
context trees used here are specifically used to represent the nonlinear dependency of equalizer coefficients 
on the soft information. In this sense, as an example, the time adaptation is mainly (in addition to learning) 
due to the time variation of the soft information coming from the decoder, unlike the time dependent 
learning in [18]. Hence, in here, we explicitly calculate the MSE performance and quantify the difference 
between the MSE of the context tree algorithm and the MSE of the linear MMSE equalizer, which is 
the main objective. 

The paper is organized as follows. In Section II, we introduce the basic system description and provide 
the objective of the paper. The nonlinear equalizers studied are introduced in Section III. In Section III, 
we first introduce a partitioned linear turbo equalization algorithm, where the partitioning of the regions is 
fixed. We continue in Section III-B with the turbo equalization framework using context trees, where the 
corresponding algorithm with the guaranteed performance bounds is introduced. Furthermore, we provide 
the MSE performance of all the algorithms introduced and compare them to the MSE performance of the 
linear MMSE equalizer. The paper concludes with numerical examples demonstrating the performance 
gains and the learning mechanism of the algorithm. 

II. System Description 
Throughout the paper, all vectors are column vectors and represented by boldface lowercase letters. 



Matrices are represented by boldface uppercase letters. Given a vector x, \\x\\ = Vx^x is the Z2-norm, 
where x^ is the conjugate transpose, x^ is the ordinary transpose and x* is the complex conjugate. For 
a random variable x (or a vector x), E[x] = x (or E[x] = x) is the expectation. For a vector x, diag(a;) 
is a diagonal matrix constructed from the entries of x and x{i) is the ith entry of the vector For a square 
matrix M, tr{M) is the trace. The sequences are represented using curly brackets, e.g., {x{t)}. ljj=i ^« 
denotes the union of the sets Ai, where i = 1,. . . ,N. The vec(.) operator stacks columns of a matrix 
of dimension m x n into a mn x 1 column vector [20]. 
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The block diagram of the system we consider with a Unear turbo equaUzer is shown in Fig. 2. The 
information bits {b{t)} are first encoded using an error correcting code (ECC) and then interleaved to 
generate {c(t)}. The interleaved code bits {c{t)} are transmitted after symbol mapping, e.g., x{t) = 
(_l)c(t) jTqj. gp3p^ signaling, through a baseband discrete-time channel with a finite-length impulse 
response {h{t)}, t = 0, 1, . . . , M — 1, represented by ^ = [h{M — 1), ... , /i(0)]^. The communication 
channel h is unknown. The transmitted signal {x{t)} is assumed to be uncorrelated due to the interleaves 
The received signal y{t) is given by 



M-l 



l{t) = Y^ h{k)x{t -k) + n{t), 



fc=0 

where {n{t)} is the additive complex white Gaussian noise with zero mean and circular symmetric 
variance a"^. If a linear equalizer is used to reduce the ISI, then the estimate of the desired data x{t) 
using the received data y{t) is given by 

where w(t) = [w{t, N2), ... , w{t, -iVi)]^ is length iV = iVi + iV2 + 1 linear equalizer, y{t) = [y{t - 
N2), . . . , y{t + Ni)]'^ and note that we use negative indices with a slight abuse of notation. The received 

data vector y{t) is given by y{t) = Hx{t) + n{t), where x{t) = [x{t - M - N2 + I),. . . ,x{t + Ni)Y 
and H G cNx{n+m-i) 

h{M - 1) h{M - 2) ... /i(0) ... 

/i(M-l) h{M-2) ... h{0) ... 



H 



A 



... h{M-l) h{M-2) ... h{0) 

is the convolution matrix corresponding to h = [h{M — 1), . . . , /i(0)]^, the estimate of x{t) can be 



written as 



x(t) = w'^{t)[y{t) - Hx{t)] + x{t), 



(1) 



given that the mean of the transmitted data is known. 

However, in turbo equalization, instead of only using an equalizer, the equalizer and decoder are jointly 
performed iteratively at the receiver of Fig. 2. The equalizer computes the a posteriori information using 
the received signal, transmitted signal estimate, channel convolution matrix (if known) and a priori 
probability of the transmitted data. After subtracting the a priori information, LLRf', and de-interleaving 
the extrinsic information LLRf , a soft input soft output (SISO) channel decoder computes the extrinsic 
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Fig. 2. Block diagram for a bit interleaved coded modulation transmitter and receiver with a linear turbo equalizer. 



infomiation LLR^ on coded bits, which are fed back to the Unear equaUzer as a priori information 
LLRf' after interleaving. 

As the linear equalizer, if one uses the linear MMSE equalizer for w{t), the mean and the variance of 
x{t) are required to calculate w{t) and x{t). These quantities are computed using the a priori information 
from the decoder as x{t) = E[x{t) : {LLRf (t)}]' and q{t) = E[x^{t) : {LLRf (t)}] - x^{t). As 
an example, for BPSK signaling, the mean and variance are given as x{t) = tanh(LLRf^(t)/2) and 
q{t) = 1 — |x(t)p. However, to remove dependency of x{t) to LLRf^(t) due to using x{t) and q{t) in 
(1), one can set LLRf'(t) = while computing x{t), yielding x{t) = and q{t) = 1 [5]. Then, the linear 
MMSE equalizer is given by 



w{t) = [v^iall + HrQ{t)H^ + vv 



H\-liT 



(2) 



where Q{t) = E[{x{t) — x{t)){x{t) — x{t)) : {LLR^ {t)}] is a diagonal matrix (due to uncorrelateness 



A 



assumption on x{t)) with diagonal entries Q{t) = diag(g(t)), q{t) = [q{t — M — N2 + I), ■ ■ ■ ,q{t 



'with a slight abuse of notation, the expression E[x{t) : y{t)] is interpreted here and in the sequel as the expectation of x{t) 
with respect to the prior distribution y(t). 
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l),q{t + l),...,q{t + Ni)f, V e €^ is the (M + N2)\h column of H, Hr is the reduced fomi of H 
where the (M + A^2)th column is removed. The linear MMSE equalizer in (1) yields 

x{t)=w^{t)[y{t)-Hx{t)] 

= w'^{t)y{t) - f{t)x{t), (3) 

where f{t) = H w{t). In this sense the linear MMSE equalizer can be decomposed into a feedforward 
filter w{t) processing y{t) and a feedback filter /(t) processing x{t). 

Remark 1: Both the linear MMSE feedforward and feedback filters are highly nonlinear functions of q, 
i.e., 

w = W{q) = [v^iall + Hr diag(q) H^ + vv^Y^ , (4) 

f = F{q)=H'^W{q), 

where W(-), T{-) : (C^+^~2 — > C^. We point out that time variation in (2) is due to the time variation 
in the vector of variances q, (assuming h is time-invariant). 

To learn the corresponding feedforward and feedback filters that are highly nonlinear functions of q, 
we use piecewise linear models based on vector quantization and context trees in the next section. The 
space spanned by q is partitioned into disjoint regions and a separate linear model is trained for each 
region to approximate functions yV(-) and F{-) using piecewise linear models. 

Note that if the channel is not known or estimated, one can directly train the corresponding equalizers 
in (3) using adaptive algorithms such as in [4], [11] without channel estimation or piecewise constant 
partitioning as done in this paper. In this case, one directly applies the adaptive algorithms to feedforward 
and feedback filters using the received data {y{t)} and the mean vector x{t) as feedback without 
considering the soft decisions as a priori probabilities. Assuming stationarity of x{t), such an adaptive 
feedforward and feedback filters have Wiener solutions [11] 

w = [v^{all + H.,E[Q{t)]H^ + vv^Y^f , (5) 

/ = H^w. 

Note that assuming stationarity of the log likelihood ratios [11], E[Q{t)] is constant in time, i.e., no 
time index for w, f in (5). When PSK signaUng is used such that E[\x(t)\'^ : {LLRf (t)}] = 1, the filter 
coefficient vector in (5) is equal to the coefficient vector of the MMSE equalizer in [5] with the time 
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Fig. 3. The EXIT chart for the exact MMSE turbo equalizer, the LMS turbo equalizer and their trajectory in a time invariant ISI 
channel [0.227, 0.46, 0.688, 0.46, 0.227]^. Here, we have SNR = lOdB, Ni = 9, iV2 = 5, feedback filter length N + M-l = 
19, data block length = 8192, training data length — 2048, n — 0.001, BPSK signaling, random interleaver and i rate 
convolutional code with constraint length of 3. 



averaged soft information, i.e., time average instead of an ensemble mean. Comparing (5) and (2), we 
observe that using the time averaged soft information does not degrade equalizer performance in the no 
a priori information, i.e., Q{t) = I or perfect a priori information, i.e., Q{t) = 0, cases. In addition, the 
performance degradation in moderate ISI channels is often small [5] when perfect channel knowledge is 
used. However, the performance gap increases in ISI channels that are more difficult to equalize, even in 
the high SNR region [11], since the effect of the filter time variation increases in the high SNR region. 
Comparison of an exact MMSE turbo equalizer with/without channel estimation error and an MMSE 
turbo equalizer with the time averaged soft variances (i.e. when an ideal filter for the converged adaptive 
turbo equalizer is used) via the EXIT chart [21] is given in Fig. 3. As the adaptive turbo equalizer, 
a decision directed (DD) LMS turbo equalizer is used in the data transmission period, while LMS is 
run on the received signals for the first turbo iteration and on the received signals and training symbols 
for the rest of turbo iterations in the training period. Note that the tentative decisions can be taken as 
the hard decisions at the output of the linear equalizer or as the soft decisions from the total LLRs at 
the output of decoder. When we consider nonideality, both of the MMSE turbo equalizer with channel 
estimation error and DD-LMS turbo equalizer loose mutual information at the equalizer in first few turbo 
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iterations ^. Even though there is a loss in mutual information at the equalizer in the first and second 
turbo iteration due to using decision directed data or channel estimation error, both algorithms follow 
their ideal performance at the end. (i.e., the DD LMS turbo equalizer can achieve the performance of 
the time-average MMSE turbo equalizer as the decision data gets more reliable). However, there is still 
a gap in achieved mutual information between the exact MMSE turbo equalizer and the LMS adaptive 
turbo equalizer except for the no a priori information and perfect a priori information cases. Note that 
such a gap can make an adaptive turbo equalizer become trapped at lower SNR region while an MMSE 
turbo equalizer converges as turbo iteration increases. 

To remedy this, in the next section, we introduce piecewise linear equalizers to approximate W(-) 
and F{-). We first discuss adaptive piecewise linear equalizers with a fixed partition of (£;^+^^2 (^vhere 
q G (^N+M-2y -pj^gj^^ ^g introduce adaptive piecewise linear equalizers using context trees that can learn 
the best partition from a large class of possible partitions of (D^+^-2 

III. Nonlinear Turbo Equalization Using Piecewise Linear Models 
A. Piecewise Linear Turbo Equalization with Fixed Partitioning 

In this section, we divide the space spanned by q € [0, i]^+A^-2 (assuming BPSK signaling for 
notational simplicity) into disjoint regions Vk, e.g., [0, i]^+^^2 _ y^^^ y^, for some K and train an 
independent linear equalizer in each region Vk to yield a final piecewise linear equalizer to approximate 
w = W(q) and / = F{q). As an example, given K such regions, suppose a time varying linear equalizer 
is assigned to each region as Wk{t), fk{t), k = 1,. . . ,K, such that at each time t, if q{t) G Vk, the 
estimate of the received signal is given as 

Xk{t) = wl{t)y{t)-fl{t)x{t) (6) 

x{t) = Xk{t). 

We emphasize that the time variations in Wk{t) and /^(i) in (6) are not due to the time variation in q{t) 
unlike (3). The filters Wk{t) and /^(t) are time varying since they are produced by adaptive algorithms 
sequentially learning the corresponding functions yV(-) and T{-) in region Vk- Note that if K is large and 
the regions are dense such that W(g) (and F{q)) can be considered constant in Vk, say equal to yV{qk) 
for some q^ in region Vk, then if the adaptation method used in each region converges successfully, this 
yields Wk{t) — )■ W(qfc) and /^(t) — )■ J^{qk) as t — )■ oo. Hence, if these regions are dense and there is 

^This performance loss in the first few turbo iterations can fail the iterative process in low SNR region 

February 27, 2013 DRAFT 



11 



enough data to learn the corresponding models in each region, then this piecewise model can approximate 
any smoothly varying W(q) and T{q) [22]. 

In order to choose the corresponding regions Vi, . . . , Vk, we apply a vector quantization (VQ) algorithm 
to the sequence of {q(t)}, such as the LBG VQ algorithm [23]. If a VQ algorithm with K regions and 
Euclidean distance is used for clustering, then the centroids and the corresponding regions are defined as 

A Ef,g(f)gt4 g(^) .^. 

(Ik — —^ 7-1 yi) 

2^t,qit)£Vk ^ 

Vk = {q- Wq-QkW < h-Qi\\,i = l,---,K,i / k}, (8) 

where q{t) = [q{t - M - N2 + 1), . . . ,q{t - l),q{t + I), . . . , q{t + iVi)]^, and q G <cN+M-2_ ^^ 
emphasize that we use a VQ algorithm on {q(t)} to construct the corresponding partitioned regions in 
order to concentrate on q vectors that are in {g(t)} since W(-) and F{-) should only be learned around 
q G {q{t)}, not for all (^^+^^-2 After the regions are constructed using the VQ algorithm and the 
corresponding filters in each region are trained with an appropriate adaptive method, the estimate of x{t) 
at each time t is given as x{t) = Xi{t) if i = argmin^ ||q(t) — q^||. 

In Fig. 4, we introduce such a sequential piecewise linear equalizer that uses the LMS update to train 
its equalizer filters. Here, ^ is the learning rate of the LMS updates. One can use different adaptive 
methods instead of the LMS update, such as the RLS or NLMS updates [24], by only changing the filter 
update steps in Fig. 4. The algorithm of Fig. 4 has access to training data of size T. After the training 
data is used, the adaptive methods work in decision directed mode [24]. Since there are no a priori 
probabilities in the first turbo iteration, this algorithm uses an LMS update to train a linear equalizer 
with only the feedforward filter, i.e., x{t) = w'^{t)y{t), without any regions or mean vectors. Note that 
an adaptive feedforward linear filter w{t) trained on only {y{t)} without a priori probabilities (as in the 
first iteration) converges to [11] (assuming zero variance in convergence) 

lim w{t) = Wo = [v^{(tII + HrHr + vv^yY, 

which is the linear MMSE feedforward filter in (2) with Q{t) = /. 

In the pseudo-code in Fig. 4, the iteration numbers are displayed as superscripts, e.g., w^^^' {t), ff"{t) 
are the feedforward and feedback filters for the mth iteration corresponding to the ith region, respectively. 
After the first iteration when {q{t)} become available, we apply the VQ algorithm to get the corresponding 
regions and the centroids. Then, for each region, we run a separate LMS update to train a linear equalizer 
and construct the estimated data as in (6). In the start of the second iteration, in line A, each feedforward 
filter is initialized by the feedforward filter trained in the first iteration. Furthermore, although the linear 
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A Pseudo-code of Piecewise Linear Turbo Equalizer: 



% 1st iteration: 

for t = 1, . . . , n: 

X{t) =wWT(^t)y(^t), 

if t < T: e(t) = x(t) - x(t), elseif t > T: e{t) = Q{x{t)) ~ x(t), Q(-) is a quantizer. 

wW(t + l) = wWl^t) + ,ie(t)v{t). 
calculate q{t) using the SISO decoder for t > T. 
% 2nd iteration: 

apply LBG VQ algorithm to {'7(t)}t>y to generate q^ , fc = 1, . . . , ii". 

for k = 1, . . . ,K: loj. '(0) = 'w^^'(n), where w^^'iji) is from 1st iteration. (line A) 

fort = 1,...,T: 

for k = l,...,K: 

ek(t) = x{t) - w^^^'^{t)y(t) - ff^'^{t)[I - diag(q^''')]i/23,(t), (line B) 

■wf\t + l) = 4'\t) +Mefc(t)3/(t), ff\t + l) = 4')(t) +Mefc(t)[/ -diag(q^'')]l/2^(t). 
for t = T + l,...,n: 

~ (2) 

i = argmiiifc \\q{t) - q^ '\\, 

£(t)=«,f'^(t)y(t)-/f^(t)S(J), 

e(t) = Q{x(t)) - x{t), 

u,f^ (t + 1) = ^f^ (t) + f.e{t)y{t), /f ) {* + !) = /f it) + Me(t)x(t). 
calculate {^(i)}^^ using the SISO decoder 
% mth iteration: 

apply LBG VQ algorithm to {q{t)} tyx '" generate q)^ , k = 1, . . . , K. 

for k = l,...,K: w'jT'Ho) = w'^'^-'\n), fi"'\o) = /^""''(n), where j = argmin. Hq^'"' - ql^'-'^l (line C) 
for t = 1,...,T: 

ioi k = l,...,K: 

e,{t) = x{t) - ^(")^(t)j/(t) - /(™)^(t)[/ - diag(q,)]l/2^(i), 

■w^rHt + 1) = -L^f^^'Ct) +/.efe(t)y(t), /^'"'(t + 1) = /^"'(t) + /.efe{t)[7 - diag(q^'"))]l/2a;(t). 
for t = T + l,...,n, 

II fx\ ~ (^^) II 

I = argmmfe \\q(t) - q], '\\, 

x{t) = n,^r^^(t)y{t)^f^r^^(t)m. 



e(t) = Q{x(t)) - x{t) 
calculate {^(i)}^^ using the SISO decoder 



^('"' (t + 1) = »<™' (t) + /.e(t)y (t), /('"' (t + 1) = /^' (t) + Me(i)*{t). 



Fig. 4. A piecewise linear equalizer for turbo equalization. This algorithm requires 0{M + A^) computations. 



equalizers should have the form tir[{t)y{t) — f'^{t)x{t), since we have the correct x{t) in the training 
mode for t = 1, . . . , T, the algorithms are trained using w'j^{t)y{t) — f\{t) [I — diag(q^,)]^/^a;(t) in (line 
B), i.e., x{t) is scaled using [/ — diag(g,^.)]^'^, to incorporate the uncertainty during training [25]. After 
the second iteration, in the start of each iteration, in line C, the linear equalizers in each region, say 
k, are initialized using the filters trained in the previous iteration that are closest to the fcth region, i.e.. 



II ~(m) 

J = argmmj \\ql 



~(m-l)|| (ni)/n\ 

Qi l^l '{0) 



^;""^)(n),and/l'")(0) 



j.(m-l)/ N 



Assuming large K with dense regions, we have q{t) ss q^ when q{t) G Vk- To get the vectors that the 
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LMS trained linear filters in region k eventually converge, i.e., the linear MMSE estimators assuming 
stationary x, we need to calculate E[{x{t) — x{t)){x{t) — x{t))^ : q{t) = q^,], which is assumed to be 
diagonal due to the interleaving [11], yielding 

E{[x{t) - xm^it) - x{t)f : q{t) = q,} 

= E!^E{[x{t)-xmx{t)-xit)f : {LLRf(t)},q(t)=qJ : q(t) = q^j 

= diag{ fc(l), . . . , iikiM + N2- 1), 1, qk{M + N2),..., q^M + iV - 2)]} 

due to the definition of q{t) and assuming stationary distribution on x{t). This yields that the linear 
filters in region k converge to 

lim Wk{t) = Wk,o = [v^{<jIi + HrQkHf^ + vv^yY, 

lim /fc(t) = fk,o = H^Wk,o, (9) 

where Q^ = diag(q^), assuming zero variance at convergence. Hence, at each time t, assuming con- 
vergence, the difference between the MSE of the equalizer in (9) and the MSE of the linear MMSE 
equalizer in (2) is given by 

\\wl^HrQ{t)H^wl^ + alwl^wl^ - [1 - ^^(^2/ + HrQ{t)H^ + vv^)-^v]\\ < OiMt) - q,\\), 

(10) 
as shown in Appendix A. Due to (10) as the number of piecewise linear regions, i.e., K, increases and 
IIq(^) ~ Qfcll approaches to 0, the MSE of the converged adaptive filter more accurately approximates the 
MSE of the linear MMSE equalizer. 

In the algorithm of Fig. 4, the partition of the space of q G [0, i]^+*^-2 j^ fixed, i.e., partitioned 
regions are fixed at the start of the equalization, after the VQ algorithm, and we sequentially learn a 
different linear equalizer for each region. Since the equalizers are sequentially learned with a limited 
amount of data, these may cause training problems if there is not enough data in each region. In other 
words, although one can increase K to increase approximation power, if there is not enough data to learn 
the corresponding linear models in each region, this may deteriorate the performance. To alleviate this, 
one can try a piecewise linear model with smaller K in the start of the learning and gradually increase K 
to moderate values if enough data is available. In the next section, we examine the context tree weighting 
method that intrinsically does such weighting among different models based on their performance, hence, 
allowing the boundaries of the partition regions to be design parameters. 
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q(t + l) 
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q(t-l) 
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left-Child 



y-0 K 

q{t + \)/' ~\, .q(t + \) 
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q(t-\) 



right-child 

q{t-\) 
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q(t-l) 

— ► 



1 



q{t-\) 



1 



q(t-l) 



q{t-i) 











1 U 1 U 1 

left-right right-left right-right 

child child child 



Fig. 5. A full binary context tree with depth, D — 2, with 4 leaves. The leaves of this binary tree partitions [0, 1] , i.e., 
[q{t - 1) q{t + 1)] G [0, 1]^, into 4 disjoint regions. 



B. Piecewise Linear Turbo Equalization Using Context Trees 

We first introduce a binary context tree to partition [0, i]^+^^^2 jj^^^ disjoint regions in order to 
construct a piecewise linear equalizer that can choose both the regions as well as the equaUzer coefficients 
in these regions based on the equalization performance. On a context tree, starting from the root node, 
we have a left hand child and a right hand child. Each left hand child and right hand child have their 
own left hand and right hand children. This splitting yields a binary tree of depth D with a total of 2^ 
leaves at depth D and 2^+^ — 1 nodes. As an example, the context tree with D = 2 in Fig. 5 partitions 
[0, 1]^, i.e., q = [q{t — l),q{t + 1)]^ G [0, 1]^, into 4 disjoint regions. Each one of these 4 disjoint regions 
is assigned to a leaf on this binary tree. Then, recursively, each internal node on this tree represents a 
region (shaded areas in Fig. 5), which is the union of the regions assigned to its children. 

On a binary context tree of depth D, one can define a doubly exponential number, m ss 1.5^ , 
of "complete" subtrees as in Fig. 6. A complete subtree is constructed from a subset of the nodes of 
the original tree, starting from the same root node, and the union of the regions assigned to its leaves 
yields [0, i]^+a^-2 p^j. example, for a subtree i, if the regions assigned to its leaves are labeled as 
Vi,i, ■ ■ ■ , yK„i where Ki is the number of leaves of the subtree i, then [0, i]^+A^-2 _ jj^j^^ Vk^i- Each 
Vfc j of the subtree corresponds to a node in the original tree. 

With this definition, a complete subtree with the regions assigned to its leaves defines a complete 
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r, 



q(t-l) 



1 



r. 



n 



<7(f4;'l) \.q(t + \) 



q(t-l) 



q{t-l) 



Cl{t+,l) 



q(t-l) 



P 1 1 1 ^'}^'/^'> \| ■?(' + !) 
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q(t-i) 



q(t-l) 



qit + iy -N q{t + l) 



1 1 
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q(t-l) 



1 



1 1 






l(t + i)y' '--.. q(t + l) 



1 
?(f-l) 



qit-l) 



q{t + r)/\ 



1 



q{t-l) 



q(t-l) 



1 1 1 1 



Fig. 6. All partitions of [0, 1]^ using binary context tree with D = 2. Given any partition, the union of the regions represented 
by the leaves of each partition is equal to [0, 1]'^. 



"partition" of [0, i]^+^^-2 Continuing with our example, on a binary tree of depth D = 2, we have 5 
different partitions of [0, 1]^ as shown in Fig. 6 labeled as Ti, . . . jFs. As in the previous section, we 
partition regions with the LBG VQ algorithm and then construct our context tree over these regions as 
follows. Suppose the LBG VQ algorithm is applied to {q{t)} with i^ = 2^ to generate 2^ regions 
[23]. The LBG VQ algorithm uses a tree notion similar to the context tree introduced in Fig. 6 such 
that the algorithm starts from a root node which calculates the mean of all the vectors in {q{t)} as the 
root codeword, and binary splits the data as well as the root codeword into two segments. Then, these 
newly constructed codewords are iteratively used as the initial codebook of the split segments. These 
two codewords are then split in four and the process is repeated until the desired number of regions are 
reached. At the end, this binary splitting and clustering yield 2^ regions with the corresponding centroids 
Qj, i = 1, . . . ,2^, which are assigned to the leaves of the context tree. Note that since each couple of 
the leaves (or nodes) come from a parent node after a binary splitting, these parent codewords are stored 
as the internal nodes of the context tree, i.e., the nodes that are generated by splitting a parent node are 
considered as siblings of this parent node where the centroid before splitting is stored. Hence, in this 
sense, the LBG VQ algorithm intrinsically constructs the context tree. However, note that, at each turn, 
even though the initial centroids at each splitting directly come from the parent node in the original LBG 
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VQ algorithm, the final regions of the leaf nodes while minimizing distortion by iterating (7) and (8) 
may invade the regions of the other parents nodes, i.e., the union of regions assigned to the children 
of the split parent node can be different than the region assigned to the parent node. Note that one can 
modify the LBG algorithm with the constraint that the region of the children nodes should be optimized 
within the regions of the their parent nodes. However, this constraint may deteriorate the performance 
due to more quantization error. 

Given such a context tree, one can define m ?« (1-5)^ different partitions of the space spanned by q 
and construct a piecewise linear equalizer as in Fig. 4 for each such partition. For each such partition, one 
can train and use a piecewise linear model. Note that all these piecewise linear equalizers are constructed 
using subsets of nodes /? G {1, . . . , 2^+^ — 1}. Hence, suppose we number each node on this context tree 
/9 = 1, . . . , 2-^+^ — 1 and assign a linear equalizer to each node as Xp{t) = wj{t)y{t) — f^{t)x{t). The 
linear models Wp, f that are assigned to node p, train only on the data assigned to that node as in Fig. 4, 
i.e., if q{t) is in the region that is assigned to the node Vp then Wp and /„ are updated. Then, the piecewise 
linear equalizer corresponding to the partition Fj = {Vi^i, . . . , VK^,i} (where (J^J^ V^ j = [0, i]^+*^-2^^ 
say xr. (i)' is defined such that if q{t) G T4 « and p is the node that is assigned to V^ j then 

xr.(t) = Xp{t) 

= wl{t)y{t)-fp{t)x{€). (11) 

One of these partitions, with the given piecewise adaptive linear model air, (i) achieves the minimal 
loss, e.g., the minimal accumulated squared error X]"=i(^(0 ~ ^r. (i))^> for some n. However, the best 
piecewise model with the best partition is not known a priori. We next introduce an algorithm that 
achieves the performance of the best partition with the best linear model that achieves the minimal 
accumulated square-error with complexity only linear in the depth of the context tree per sample, i.e., 
complexity O {D{2N + M)) instead of O ({l.hfD{2N + M)) , where D is the depth of the tree. 



Remark 2: We emphasize that the partitioned model that corresponds to the union of the leaves, i.e., 
the finest partition, has the finest partition of the space of variances. Hence, it has the highest number 
of regions and parameters to model the nonlinear dependency. However, note that at each such region, 
the finest partition needs to train the corresponding linear equalizer that belongs to that region. As an 
example, the piecewise linear equalizer with the finest partition may not yield satisfactory results in the 
beginning of the adaptation if there is not enough data to train all the model parameters. In this sense, 
the context tree algorithm adaptively weights coarser and finer models based on their performance. 
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To accomplish this, we introduce the algorithm in Fig. 7, i.e., Xctw{t), that is constructed using the 
context tree weighting method introduced in [14]. The context tree based equalization algorithm implicitly 
constructs all xr^ (t), i = 1, . . . ,m, piecewise linear equalizers and acts as if it had actually run all these 
equalizers in parallel on the received data. At each time t, the final estimation Xctw{t) is constructed 
as a weighted combination of all the outputs xr. (t) of these piecewise linear equalizers, where the 
combination weights are calculated proportional to the performance of each equalizer xr, (t) on the past 
data. However, as shown in (11), although there are m different piecewise linear algorithms, at each time 
t, each xr. (t) is equal to one of the D node estimations to which q{t) belongs, e.g., if q{t) belongs to 
the left-left hand child on Fig. 5, a;r^(t), i = I, . . . ,m, use either the node estimations Xp{t) that belong 
to the left-left hand child or the left-hand child or the root. How the context tree algorithm keeps the 
track of these m piecewise linear models as well as their performance -based combination weights with 
computational complexity only linear in the depth of the context tree is explained in Appendix B. 

For the context tree algorithm, since there are no a priori probabilities in the first iteration, the first 
iteration of Fig. 7 is the same as the first iteration of Fig. 4. After the first iteration, to incorporate the 
uncertainty during training as in Fig. 4, the context tree algorithm is run by using weighted training data 
[25]. At each time t > T, Xctw{t) constructs its nonlinear estimation of x{t) as follows. We first find the 
regions to which q{t) belongs. Due to the tree structure, one needs only find the leaf node in which q{t) 
lies and collect all the parent nodes towards the root node. The nodes to which q{t) belongs are stored in 
I in Fig. 7. The final estimate Xctw{t) is constructed as a weighted combination of the estimates generated 
in these nodes, i.e., Xp{t), p ^ I, where the weights are functions of the performance of the node estimates 
in previous samples. At each time t, Xctw{t) requires 0{ln{D)) calculations to find the leaf to which q{t) 
belongs. Then, D + 1 node estimations, Xp{t), p ^ I, are calculated and the filters at these nodes should 
be updated with 0{2N + M) computations. The final weighted combination is produced with 0{D) 
computations. Hence, at each time the context tree algorithm requires 0{D{2N + M)) computational 
complexity. For this algorithm, we have the following result. 

Theorem 2: Let {x(t)}, {n(t)} and {y(t)} represent the transmitted, noise and received signals and 
{q(t)} represents the sequence of variances constructed using the a priori probabilities for each con- 
stellation point produced by the SISO decoder. Let Xp{t), p = 1, . . . , 2^^^ — 1, are estimates of x{t) 
produced by the equalizers assigned to each node on the context tree. The algorithm Xct^{t), when 
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applied to {y(t)}, for all n achieves 

n ( n ^ 

Y, H*) - xetw(t))' < mill i J]; [x{t) - xrAt)]^ + 2Ki-l\, (12) 

t=i ' I t=i J 

for flZZ i, i = 1, ..., m ~ (1.5)^ , assuming perfect feedback in decision directed mode i.e., Q{x(t)) = 
x{t) when t > T, where x^iit) is the equalizer constructed as 

xrXt) = Xpit), 

p is the node assigned to the volume in Fj = {Vi^j, . . . , Vxi^i} such that q{t) belongs and Ki is the 
number of regions in Fj. If the estimation algorithms assigned to each node are selected as adaptive 
linear equalizers such as an RLS update based algorithm, (12) yields 

n ( n 

V [x{t) - Xctw(i)] ^ < mill i mill V [x{t) - w'^^f^^.^yit) - fLt),iXit)?+ 

0[{2N + M)\n{n))+2Ki-l\. (13) 

where Si{t) is an indicator variable for Fj such that if q{t) G Vk^i, then Sj(t) = k. 
An outline of the proof of this theorem is given in Appendix B. 

Remark 3: We observe from (12) that the context tree algorithm achieves the performance of the best 
sequential algorithm among a doubly exponential number of possible algorithms. Note that the bound in 
(12) holds uniformly for all i, however the bound is the largest for the finest partition corresponding to 
all leaves. We observe from (13) that the context tree algorithm also achieves the performance of even the 
best piecewise linear model, independently optimized in each region, for all i when the node estimators 
in each regions are adaptive algorithms that achieve the minimum least square-error. 

1) MSE Performance of the Context Tree Equalizer: To get the MSE performance of the context tree 
equalizer, we observe that the result (13) in the theorem is uniformly true for any sequence {x{t)}. 
Hence, as a corollary to the theorem, taking the expectation of both sides of (13) with respect to any 
distribution on {x{t)} yields the following corollary: 
Corollary: 

n ( n 

Y, E { [x{t) - Xctw(t)] '} < min \ min J] i? { [x{t) - wl^,^^y{t) - /f,(,_i),,S(t)]2| + 

0{{N + M)\n{n))+2Ki-l\. (14) 
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Note that (13) is true for all i, and given i for any w^^i, f^^,i = l,..., Ki, i.e., 

n n 

Y. [x(t)-xctw(t)]' < Y. Ht)-^l{t)M^)-fl{t-t),^{t)f + 0{{N + M)\n{n))+2K,-l, (15) 
t=i t=i 

since (13) is true for the minimizing i and equalizer vectors. Taking the expectation of both sides of (15) 
and minimizing with respect to i and w^^i, f^^,i = l,...,Ki yields the corollary. 

We emphasize that the minimizer vectors lu^ j and /^ j at the right hand side of (14) minimize the 
sum of all the MSEs. Hence, the corollary does not relate the MSE performance of the CTW equalizer 
to the MSE performance of the linear MMSE equalizer given in (2). However, if we assume that the 
adaptive filters trained at each node converge to their optimal coefficient vectors with zero variance and 
for sufficiently large D and n, we have for piecewise linear models such as for the finest partition 

n n 

Y Ht) - ^r,., (t))' « Y. { H*) - ^l.m,\K\Mt) - fl.itmio^t)]'} > (i6) 
t=i t=i 

where we assumed that, for notational simplicity, the |i^|th partition is the finest partition, w^ u\ij^ig 
and fsfi^f{t),\K\,o ^^ the MSE optimal filters (if defined) corresponding to the regions assigned to the 
leaves of the context tree. Note that we require D to be large such that we can assume q{t) to be constant 
in each region such that these MSE optimal filters are well-defined. Since (13) is correct for all partitions 
and for the minimizing w, f vectors, (13) holds for any w and /'s pairs including iA'siici(t),|ii'|,o ^^d 
fs,K\{t),\K\,o P^ir- Since (13) in the theorem is uniformly true, taking the expectation preserves the bound 
and using (16), we have 

1 .", 1 "_ /oD+l' 

-Ye{ [xit) - xetw(t)]'} < - E ^ { H') - <,mKi,oyit) - fl^mKi,om]'}+o — 

t=l t=l 



(17) 



since for the finest partition K\j^\ = 2^ . Using the MSE definition for each node in (17) yields 

^ t=i 

-n^ KK|(t),|i^l,o^-<5(i)Ji"f «^:,^,(i),li^|,o + ^n<,,(t),l^l,o<iK|{t),|/^|,o| + O [^^J (18) 
^ ^ E l^ij ^ { [^(*) - ^^^(*) - f^m'l^ii)} + O (^) 1+0 (^) , (19) 



where (19) follows from assuming large D, the MSE in each node is bounded as in (10). Note that 
0{\\q{t) — QkW) at the right hand side of (10) can further be upper bounded by O [■^) assuming large 
enough D with the partition given in Fig. 5 since we have 2^ regions and ||q(t)|| < VN + M — 2. 
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Hence, as Z) — ;• oo, the context tree algorithm asymptotically achieves the performance of the linear 
MMSE equalizer, i.e., the equalizer that is nonlinear in the variances of the soft information. 

IV. Numerical Examples 

In the following, we simulate the performance of the algorithms introduced in this paper under different 
scenarios. A rate one half convolutional code with constraint length 3 and random interleaving is used. 
In the first set of experiments, we use the time invariant channel from [7] (Chapter 10) 

h = [0.227, 0.46, 0.688, 0.46, 0.227]^ 

with the training size T = 1024 and data length 5120 (excluding training). The BERs and MSE curves 
are calculated over 20 independent trials. The decision directed mode is used for all the LMS algorithms, 
e.g., for the ordinary LMS turbo equalizer we compete against and for all the node filters on the context 
tree. Our calculation of the extrinsic LLR at the output of the ordinary LMS algorithm is based on [9]. 
For all LMS filters, we use A^^i = 9, N2 = 5, length A^ + M — 1 = 19 feedback filter. The learning rates 
for the LMS algorithms are set to ^u = 0.001. This learning rate is selected to guarantee the convergence 
of the ordinary LMS filter in the training part. The same learning rate is used directly on the context 
tree without tuning. In Fig. 8a, we demonstrate the time evaluation of the weight vector for the ordinary 
LMS turbo equalization algorithm in the first turbo iteration. We also plot the convolution of the channel 
h and the converged weight vector of the LMS algorithm at the end of the first iteration in Fig. 8b. In 
Fig. 9, we plot BERs for an ordinary LMS algorithm, a context-tree equalization algorithm with D = 2 
given in Fig. 7 and the piecewise linear equalization algorithm with the finest partition, i.e., xr^,i^{t), on 
the same tree. Note that the piecewise linear equalizer with the finest partition, i.e., Fs, in Fig. 6, has the 
finest partition with the highest number of linear models, i.e., 2^ independent filters, for equalization. 
However, we emphasize that all the linear filters in the leaves should be sequentially trained for the 
finest partition. Hence, as explained in Section III-B, the piecewise linear model with the finest partition 
may yield inferior performance compared to the CTW algorithm that adaptively weights all the models 
based on their performance. We observe that the context tree equalizer outperforms the ordinary LMS 
equalizer and the equalizer corresponding to the finest partition for these simulations. In Fig. 10, we plot 
the weight evaluation of the context tree algorithm, i.e., the combined weight in line F of Fig. 7, to show 
the convergence of the CTW algorithm. Note that the combined weight vector for the CTW algorithm 
is only defined over the data length period 5120 at each turbo iteration, i.e., the combined weight vector 
is not defined in the training period. We collect the combined weight vectors for the CTW algorithm in 
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the data period for all iterations and plot them in Fig. 10. This results jumps in the figure, since at each 
discontinuity, i.e., after the data period, we switch to the training period and continue to train the node 
filters. The context tree algorithm, unlike the finest partition model, adaptively weights different partitions 
in each level. To see this, in Fig. 11a, we plot weights assigned to each level in a depth D = 2 context 
tree. We also plot the time evaluation of the performance measures Ap{t) in Fig. lib. We observe that the 
context tree algorithm, as expected, at the start of the equalization divides the weights fairly uniformly 
among the partitions or node equalizers. However, naturally, as the training size increases, when there is 
enough data to train all the node filters, the context tree algorithm favors models with better performance. 
Note that at each iteration, we reset node probabilities Ap{t) = 1 since a new tree is constructed using 
clustering. 

To see the effect of depth on the performance of the context tree equalizer, we plot the for the same 
channel, BERs corresponding to context tree equalizers of depth, D = I, D = 2 and D = 3 in Fig. 12. 
We observe that as the depth of the tree increases the performance of the tree equalizer improves for 
these depth ranges. However, note that the computational complexity of the CTW equalizer is directly 
proportional to depth. As the last set of experiments, we perform the same set of experiments on a 
randomly generated channel of length 7 and plot the BERs in Fig. 13. We observe similar improvement 
in BER for this randomly generated channel for these simulations. 

V. Conclusion 

In this paper, we introduced an adaptive nonlinear turbo equalization algorithm using context trees to 
model the nonlinear dependency of the linear MMSE equalizer on the soft information generated from 
the decoder We use the CTW algorithm to partition the space of variances, which are time dependent and 
generated from the soft information. We demonstrate that the algorithm introduced asymptotically achieves 
the performance of the best piecewise linear model defined on this context tree with a computational 
complexity only of the order of an ordinary linear equalizer. We also demonstrate the convergence of 
the MSE of the CTW algorithm to the MSE of the linear minimum MSE estimator as the depth of the 
context tree and the data length increase. 
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Appendix 

A) To calculate the difference between the MSE of the equalizer in (9) and the MSE of the linear MMSE 
equalizer in (2), we start with 

\\wl,HrQ{t)H^wl^, + alwl.wl, - [1 - v^'iall + HrQ{t)H^ + vv^'y^v]]] 

= wl^HrAQ^H^wl o + [1 - v^ia^I + HrQ^H^ + vv^y^v] - [1 - v^ {all + HrQ{t)H^ + vv^Y^v] 



H 



wiHrAQ.Hi^ wt, + ^« (M + HrAQ.H^ 



rH\-l 



M'^ 



(20) 



A 



where AQ^. = Q{t) — Q^ (and the time index in AQj. is omitted for presentation purposes) and 
M = {a^I + HrQkHj. + vv^). To simplify the second term in (20), we use the first order expansion 
from the Lemma in the last part of Appendix A to yield 



H\-l„ 



v'^iM + HrAQj.H^y'v 



(21) 



v^M-^v + tvlv^ 



aQ. 



^H 



^^(M + HrAQ^H^)-'v\ Ug^^o^^^fcj + 0(tr[AQ, AQ,]), 

= v^M-^v + tT{M-^vv^M-^AQk) + 0(tr[AQf AQ^]), (22) 

around AQ^ = 0. Hence using (22) in (20) yields 

wl^HrAQuH^wl^ + v^ [{M + HrAQj^H^)' 

= wl^HrAQ.H^wl^ + triM-'w^'M-'AQ,) + 0(||q(t) - q,f) < 0(||q(t) - q,\\), 

where the last line follows from the Schwartz inequality. □ 
Lemma: We have 



rH.^l .^-1 



V 



rH\-l 



V^^ v"" [M + HrAQkH^Y'v = {M + HrAQkH'^y^vv'' (M + HrAQkH^;) 



(23) 



Proof: To get the gradient of {M + Hj.AQj^H^)^^ with respect to AQ^, we differentiate the identity 
(M + HrAQj^H^y^{M + HrAQ^H^) = I with respect to {AQ,.)a,b, i-e., the ath and 6th element 
of the matrix AQ^, and obtain 

9(^ + Hr^Q^^r)-\ ^ ^ H^AQ.H^) + {M + HrAQ,H?)-\Hre^elH^) = 0, 
diAQk)a,b 
where Cq is a vector of all zeros except a single 1 at ath entry. This yields 

dv^{M + HrAQ.H^r^ ^ ^j,^^ ^ HrAQ,H^)-^Hre,elH^{M + HrAQ.H^y'v 



d{AQk)a,i 



tr \ebH^{M + HrAQ^H^r^vv" {M + HrAQkH^r^HrBa} , 

(24) 
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which yields the result in (23) since (24) is the (6, a)th element of the matrix in (23). □ 

B) Outline of the proof of the theorem 2: The proof of the theorem follows the proof of the Theorem 
2 of [18] and Theorem 1 of [26]. Hence, we mainly focus on differences. 
Suppose we construct xr^it), i = 1, . . . ,m and compute weights 

E,"Li 2-^(r.) exp{-a Et=l [^(0 " ^r, {rW} 
where < C{rj) < 2Kj — 1 are certain constants that are used only for proof purposes such that 

YlT=i'^~'~^ ~ 1 [^4] ^^^ a is a positive constant set to a = ^ = } . ^-, ^ [26]. If we define 

x{t) = jyk=i ^k{t)xr^{t), then it follows from Theorem 1 of [26] that 



cM = ^: ..:r' :^':Ti':'^':'\\... ^ (25) 



n n 

Y.^x{t) - X{t)f ^ ^ 
t=l t=l 



x{t) - x{t)? < YMt) - irAt)? + OiKi) 



for all z = 1, . . . , m. Hence, x{t) is the desired Xctw{t)- However, note that x{t) requires output of m 
algorithms and computes m performance based weights in (25). However, in x{t) there are only D distinct 
node predictions Xp{t) that q{t) belongs to such that all the weights with the same node predictions can 
be merged to construct the performance weighting. It is shown in [18] that if one defines certain functions 
of performance for each node as Ap{t), Bp{t) that are initialized in (line A) and updated in (line C), (line 
D), (line E) of Fig. 7, then the corresponding x{t) can be written as x{t) = Yli=i Pk{'t)xin\{t), where 
I contains the nodes that q{t) belongs to and Pki^) are calculated as shown in (line B) of Fig. 7. Hence, 
the desired equalizer is given by Xctw(i) = X]«=i l^k{t)xin\{t), which requires computing D + 1 node 
estimations and updates only D + 1 node equalizers at each time t and store 2^+^ — 1 node weights. 
This completes the outline of the proof of (12). To get the corresponding result in (13), we define the 
node predictors as the LS predictors such that 



M~^{t - l)p{t - 1), M{t - 1) = ( y d(r - l)d(r - lfsp{r) + 51 ] (26) 



vr=l 



T 

, Sp{r) is the indicator 



and p{t - 1) = J2l=\ Q{y{r))d{r - l)sp{r), where d{r) = [ y{r) x{r 
variable for node p, i.e., Sp{r) = 1 if q{r) G Vp otherwise Sp{r) = 0. The affine predictor in (26) is a 
least squares predictor that trains only on the observed data {y{t)} and {x{t)} that belongs to that node, 
i.e., that falls into the region Vp. Note that the update in (26) can be implemented with 0(1) computations 
using fast inversion methods [27]. The RLS algorithm is shown to achieve the excess loss given in (13) 
as shown in [28]. □ 
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A Pseudo-code of Piecewise Linear Turbo Equalizer Using Context Trees: 



% 1st iteration: 

for t = 1, . . . , n: 

Xct^{t) = WWT (t)y{t), 

if i < T: e(i) = x{t) - x(t). 

elseif t > T: e{t) = Q{x(t)) — x(t), Q(-) is a quantizer. 
w(^^{t + l) = wW(t) + fie{t)y{t). 
calculate q{t) using tlie SISO decoder for t > T. 

% mth iteration: 

(1) 
apply LBG VQ algorithm to {'/{t)}^^ to generate (i^,k = l 

if m == 2: 

for fc = 1, . . . , 2-'-'+^ — 1: loj. (1) = lu^^i (n), where iv^^' (n) is from 1st iteration. 

if m > 3: 



, 2° and q^' 



^9(2) 



:2"+l,. 



aD+l 



for i = l,..., 2^+1-1: i«l™Vl) = i« 



(m-l) 



„(m) 



„(m-l) 



~ (m) - (m — 1) I 



("). /fe (1) = f) ('"■)' where j = arg riling ||<5f^ - q 



for k = 1,..., 2-^+1 - 1: Ak{0) = 1, Bfe(O) = 1. 
for t = 1,...,T: 

for j= 1,...,2^: 

x(t) = [I — diag{qj^ )]^'^x{t) %to consider uncertainty during training 

1(1) = i, where I corresponds dark nodes starting from the leaf node i 

'?!(*) = 1/2. 

for « = 2, . . . , D + 1: 

rii(t) = 2^s{t ~" l)»?l-l{t). where s is a sibling node of 1(1), i.e., Vs (J V^/n = ^(;_i). 
vi(t)B, (t-i) 

A(t) -^ 



^i(i)(*-l) 



for « = D + l,...,l: 



("iJT 



{™)T. 



j„)(t) = x(t) - [wl^^>' (t)y(t) - fl^^>' {t)x{t)] 



^l(l)^^) ~ ^l(l)(^ " 1) '^-'^P ( ^^ ^i(n(*) ) ' "^ where c is a positive constant, 
iil = D + \:Ai^^{t) = Bn,^{t), 

else: A^^^^it) = ^Ai^^^fi - 1)^^,,, ,,(* - 1) + |Bi(;)(t), 

% where (i{0) ^"d (1(1), r) are the left and right hand children of 1(1), respectively, 
w[^^^^{t + 1) = ^™J(t) +^ej(,)(t)3/{t), /|"j'(t + 1) = /|"j^(i) +/.ej(,)(t)S(t). 
for t = r + l,...,n: 

« = argmirifc \\q(t) -— q), ||, fe = 1, . . . , 2 . 

find nodes that i belongs to and store them in I starting from the leaf node i, i.e., 1{1) = i. 

'?!(*) = 1/2, 

for « = 2, . . . , D + 1: 

mit) = ^As{t)rn_i{t). 



AW 



'7i(t)Sj(,>(t-l) 



^i(l)(*-l) 



„(m)T. 



f(»")t. 



for « = D + l,...,l: 

ej,,)(t) = Q(£ctw(t)) - [«,<^j'^(t)y(i) - /(^j'^(t)S(t)], 

Bj(,)(t + 1) = Bj(,)(t - 1) cxp (-C ||ej,;)(i)|| 

if« = D + l:Ai(,)(t) = Bi(;){t), 

else: ^^(^^(t) = |Aj(,)_,(t - 1)^^,,, ,,(* - 1) + ^Bj(,)(t), 






(m) 



(m) , 



„(m) . 



w (* + ^) = »i™;(*) +/^ei(o(*)2/W' /l") (*+ ^) = /!";(*) +^^^^(0 (*)*(*) 



calculate {q(,t)}fyj^ using the SISO decoder 



(line A) 



(line B) 



(line C) 

(line D) 
(line E) 



(line F) 



(line G) 



Fig. 7. A context tree based turbo equalization. This algorithm requires 0(^D(M + A^)) computations. 
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Equalizer weights for decision directed LMS - 1 st iter. 




1000 2000 3000 4000 5000 6000 7000 

Samples 



10 15 20 25 30 

Samples 



(a) 



(b) 



Fig. 8. (a) Ensemble averaged weight vector for the DD LMS algorithm in the first turbo iteration, where ^ = 0.001, T = 1024 
and data length 5120. (b) Convolution of the trained weight vector of the DD LMS algorithm at sample 5120 and the channel 
h. 




Equalizer weights for CTW 
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SNR(dB) 




1.5 2 

Samples 



Fig. 9. BERs for an ordinary DD LMS algorithm, a CTW equalizer Fig. 10. Ensemble averaged combined weight vector for the CTW 

with D = 2 and tree given in Fig. 5, the piecewise equalizer with equalizer over 7 turbo iterations. Here, we have ^ = 0.001, T = 1024, 

the finest partition, i.e., xr^{t), where /i = 0.001, A'^i = 9, A^2 = 5, data length 5120 and 7 turbo iterations. 

AT + M - 1 = 19. 
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Fig. 11. (a) The distribution of the weights, i.e., values assigned to /3i(t), i = 1,2,3, such that /?i(i) belongs to ith level, 
(b) Time evaluation of Ap{t) which represents the performance of the linear equalizer assigned to node p. Note that at each 
iteration, we reset Ap{t) since a new tree is constructed using clustering. 
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Fig. 12. BERs corresponding to CTW equalizers of depth D — 1, Fig. 13. BERs for an ordinary DD LMS algorithm, a CTW equalizer 
D — 2 and D — 3. with D — 2 and tree given in Fig. 5, the piecewise equalizer with 

the finest partition, i.e., xr^(t), where fi = 0.001, A^i = 9, N2 — 5, 

N + M - 1 = 19. 
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